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Abstract — Over discrete memoryless channels (DMC), linear 
decoders (maximizing additive metrics) afford several nice prop- 
erties. In particular, if suitable encoders are employed, the use of 
decoding algorithm with manageable complexities is permitted. 
Maximum likelihood is an example of linear decoder. For a com- 
pound DMC, decoders that perform well without the channel's 
knowledge are required in order to achieve capacity. Several such 
decoders have been studied in the literature. However, there is 
no such known decoder which is linear. Hence, the problem of 
finding linear decoders achieving capacity for compound DMC 
is addressed, and it is shown that under minor concessions, such 
decoders exist and can be constructed. 

This paper also develops a local geometric analysis, which allows 
in particular, to solve the above problem. By considering very 
noisy channels, the original problem is reduced, in the limit, to an 
inner product space problem, for which insightful solutions can 
be found. The local setting can then provide counterexamples to 
disproof claims, but also, it is shown how in this problem, results 
proven locally can be "lifted" to results proven globally. 

I. Introduction 

We consider a discrete memoryless channel with input 
alphabet X and output alphabet y. The channel is described by 
the probability transition matrix W, each row of which is the 
conditional distribution of the output symbol Y conditioned 
on a particular input X = x E X. We are interested in the 
compound channel, where the exact value of W is not known, 
either at the transmitter or the receiver. Such problems can 
often be motivated by the wireless applications with unknown 
fading realizations. Here, instead of assuming the channel W 
to be known at the receiver and transmitter, we assume that 
a set S of possible channels is known at the receiver and 
transmitter; and our goal is to design encoders and decoders 
that support reUable communication, no matter which channel 
in S actually takes place. 

Compound channels have been extensively studied in the 
literature. In particular, Blackwell et.al. [2] shown that the 
highest achievable rate is given by the following expression: 

C(S) = max inf I(P,W), (1) 

p wes 

where the maximization is over all probability distributions P 
on X. Thus, C{S) is referred to as the compound channel 
capacity. To achieve the capacity, i.i.d. (or fixed composition) 
random codes from the optimal input distribution, i.e. the 



distribution maximizing ([T|, are used. The random coding 
argument is commonly employed to prove achievability for a 
single given channel, such as in Shannon's original paper. By 
showing that the error probability averaged over the random 
ensemble can be made arbitrarily small, one can conclude that 
there exists "good" codes with low enough error probability. 
This argument is strengthened in [2] to show that with the 
random coding argument, we can indeed prove the existence 
of codes that are good for all possible channels. Adopting this 
view, in this paper, we will not be concerned about construct- 
ing the code, or even finding the optimal input distribution, 
but rather simply assume that one of the above mentioned 
universally good code is used, and focus on the designs of 
efficient decoding algorithms. 

In [2], a decoder that maximizes a uniform mixture of 
likelihoods over most possible channels is used, and shown 
to achieve capacity. The most general universal decoder is 
the maximum mutual information (MMl) decoder [4], which 
computes the empirical mutual information between each 
codeword and the received word and picks the highest one. The 
practical difficulty of implementing MMl decoders is obvious. 
As empirical distributions are used in computing the "score" of 
each codeword, it becomes challenging to efficiently store the 
exponentially many scores, and update the scores as symbols 
being received sequentially. Conceptually, when the empirical 
distribution of the received signals is computed, one can in 
principle estimate the channel W, making the assumption of 
lack in channel knowledge less meaningful. There has been 
a number of different universal decoders proposed in the 
literature, including the LZ based algorithm [10], or merged 
likelihood decoder [6]. In this paper, we try to find universal 
decoders in a class of particularly simple decoders: linear 
decoders. 

Here, linear (or additive) decoders are defined to have the 
following structure. Upon receiving the n-symbol word y, the 
decoder compute a score/decoding metric d"-{xm, y) (note that 
the score of a codeword does not depend on other codewords) 
for each codeword Xm^m — 1,2, ...,2"^, and decodes to 
the one codeword with the highest score (ties can be resolved 
arbitrarily). Moreover, the n-symbol decoding metric has the 



following additive structure 

n 

(r{x^,y) = ^d{x„^{i),y{i)) 
1=1 

where d : X x y ^ R is a (single-letter) decoding metric. 
Such decoders are called linear since the decoding metric it 
computes is indeed linear in the joint empirical distribution 
between the codeword and the received word, since 

d"{xjn,y) ^ n ■ ^ Pi^^^^^y){a,b) ■ d{a,b) 

where P[x„^,y) denotes the joint empirical distribution of 
{xm, y)- We call such a decoder a linear decoder induced by- 
the single-letter metric d. 

Linear decoders have been widely studied in [5], [11]. An 
additive decoding metric has some obvious advantages. First, 
when used with appropriate codes, it allows the decoding 
complexity to be reduced. Note that maximum likelihood (ML) 
decoder is by definition a linear decoder, with single-letter 
metric d = log W, the log likelihood of the chaimel, thus linear 
decoders can potentially use the existing decoder structures 
to simplify designs. For example, when convolutional codes 
are used, Viterbi algorithm can be used, with the path weight 
calculation replaced from the log likelihood of a specific chan- 
nel to a new metric designed for a compound set. Moreover, 
additive structures are also suitable for belief propagation 
algorithms. It is worth clarifying that the complexity reduction 
discussed here rely on certain structured codes being used, 
in the place of the random codes. In this paper, however, 
our analysis will be based on the random coding argument, 
with the implicit conjecture that there exists structured code 
resembling the behavior of random codes under linear decod- 
ing. Mathematically, as observed in [5], [11], linear decoders 
are also more interesting in that the geometric structure of 
decoders is revealed, allowing the effects of "mismatched" 
decoder to be understood with engineering insights. 

It is not surprising that for some compound channels, a 
linear universal decoder does not exist. In [5], [11], it is shown 
that S being convex and compact is a sufficient condition for 
the existence of linear universal decoders. In this paper, we 
give a more general sufficient condition for a set to admit 
a capacity achieving linear decoder, namely that S is one- 
sided, following some geometric argument that will be made 
clear later. For more general compound sets, in order to 
achieve the capacity, we have to resort to a relaxed restriction 
of the decoders, which we call generalized linear decoders. 
A generalized linear decoder, for example, the well-known 
generalized loglikelihood ratio test (GLRT), maximizes a finite 
number, K, of decoding metrics, di,d2, ■ ■ ■ , dx- The decoding 
map can then be written as 

n 

a,rgma.xVk=idl{x,n,y) = arg max vf^^ (ifc(a;„i(z), 

2 = 1 

Here, the receiver calculates in parallel K additive metrics for 
each codeword, and decodes to the codeword with the highest 



among the total 2' ' x K scores. In order such a generalized 
linear decoder to have a manageable complexity, we emphasize 
the restriction that K has to be finite. In particular, it should 
not increase with the codeword length n. For example the 
decoder proposed in [2], a mixture of likelihoods over all 
possible channels, in general might require averaging over 
polynomial(n) channels. In addition, optimizing the mixture of 
additive metrics, i.e. argmax,„ X]fe=i dk{xrn,y), cannot be 
solved by computing K parallel additive metric optimizations: 
the codewords having the best scores for each of the K metrics 
may not be the only candidates for the best score of the mixture 
of the metrics; on the other hand, if we consider a generalized 
linear decoder, the codewords having the best score for each 
of the K metrics are the only one to be considered for the 
maximum of the K metrics. 

The main result of this paper is the construction of general- 
ized linear decoders that achieve compound channel capacity 
on most compound sets. As to be shown in Section |ll] 
this construction requires solving some rather complicated 
optimization problems involving the Kullback-Leibler (KL) 
divergence (like almost every other information theoretical 
problem). To obtain insights to this problem, we introduced 
in Section [III] a special tool: local geometric analysis. In 
a nutshell, we focus on the special cases where the two 
distributions in the KL divergence are "close" to each other, 
which can be thought in this context as approximating the 
given compound channels by very noisy channels. In this 
local setting, information theoretical quantities can be naturally 
understood as quantities in an inner product space, where 
conditional distributions and decoding metrics correspond to 
vectors; divergence and mutual information correspond to 
squared norms and the data rate with mismatched linear 
decoders can be understood with projections. The relation 
between these quantities can thus be understood intuitively. 
While the results from such local approximations only apply 
to the special very noisy cases, we show in Section|V]that some 
of these results can be "lifted" to the naturally corresponding 
statements about general cases. Using this approach, we derive 
the following main results of the paper 

• First we derive a new condition on S to be "one-sided", 
cf. Definition |4] under which a linear decoder, which 
decodes using the log likelihood of the worst channel 
over the compound set, achieves capacity. This condition 
is more general than the previously known one, which 
requires S to be convex; 

• Then, we show in our main result, that if the compound 
set S can be written as a finite union of one sided 
sets, then a generalized linear decoder using the log a 
posteriori distribution of the worst channels of each one- 
sided subset achieves the compound capacity; in contrast, 
GLRT using these worst channels is not a universal 
decoder. 

Besides the specific results on the compound channels, we 
also like to emphasize the use of the local geometric anal- 
ysis. As most of multi-terminal information theory problems 



involve optimizations of K-L divergences, often between dis- 
tributions with high dimensionaUty, we beUeve the localization 
method used in this paper can be a generic tool to simpUfy 
these problems. Focusing on certain special cases, this method 
is obviously useful in providing counterexamples to disprove 
conjectures. However, we also hope to convince the readers 
that the insights provided by the geometric analysis can be 
also valuable in solving the general problem. For example, 
our definition of one-sided sets and the use of log a posteriori 
distributions as decoding metrics can be seen as "naturally" 
suggested by the local analysis. 

In the next section, we will start with the precise problem 
formulations and notations. 

II. Linearity and Universality 

We consider discrete memoryless channels with input and 
output alphabets X and y, respectively. The channel is often 
written as a probabihty transition matrix, W, of dimension 
\X\ X 1 3^1, each row of which denotes the conditional distri- 
bution of the output, conditioned on a specific value of the 
input. We are interested in the compound channel, where W 
can be any elements of a given set S, referred to as the set 
of possible channels, or the compound set. For convenience, 
we assume S to be compact. The value of the true channel is 
assumed to be fixed for the entire duration of communications, 
but not known to either the transmitter or the receiver; only 
the compound set S is assumed to be known at both. 

We assume that the transmitter and the receiver operates 
synchronously over blocks of n symbols. In each block, a 
data message tog {1,2,..., 2"^} is mapped by an encoder 

i^„:{l,2,...,2"«}^A'" 

to Fn{ra) = Xm E A'", referred to as the m*'' codeword. 
The receiver observes the received word, drawn from the 
distribution 

n 

W"{y\xm) = l[W{y{{)\xm{t)) 

i=l 

and appUes a decoding map 

G„:3^"^{1,2,...,2"«}. 

The average probability of error, averaged over a given code 
{Fn,Gn), for a specific channel W, is written as 

™=1 {y:G„{y)^m} 

A rate R is said to be achievable for the given compound set 
S iff for any e > 0, there exists a large enough block length 
n, and (i^„, G„) with rate at least R, such that for all W E S, 
Pe{Fn, Gn, W) < £. The supremum of such achievable rates 
is called the compound channel capacity, written as C{S). 
The following result from Blackwell et.al. gives the compound 
channel capacity in general. 



Lemma 1: Compound Channel Capacity [2] 

C(S) = max inf I(Px,W). (2) 

Px wes 

Remark: The random coding argument is often used in proving 
the coding theorem for a fixed channel. By showing that the 
error probability, averaged over the ensemble of random codes, 
approaches as n increases, one can draw the conclusion 
that there exists at least one sequence of codes, for which 
the probability of error, averaged over the specific codes, 
is driven to 0. A similar argument is used in compound 
channels. Here, it is however not enough to show that the 
ensemble average error probabihty is small for every W. Since 
the "good" codes for different channels can in principle be 
different, this is not enough to guarantee the existence of a 
single code that is universally good for all possible channels. 
The random coding argument is strengthened in [2] to show 
that universally good code indeed exists. The approach used 
in [2], to show that the sets of good codes corresponding to 
every possible channel have non-empty intersection, has been 
used as a standard method to study compound channels. In this 
paper, we are focused on designing efficient decoders, which 
is interesting since the optimal maximum likeUhood decoder 
is voided by the channel's law ignorance. We will not be 
particularly concerned about finding a good codebook, or even 
the optimal input distribution. To simphfy our discussions, we 
will, for most of our results, only show that the ensemble 
average error probability can be made small, when decoders 
discussed in the paper are used. Arguments similar to that of 
[2] can be used to show that the error probability can be made 
small when appropriately chosen codes are used. 

Now before we proceed to define decoders, we need to 
define some notations: 

• We always assume that we are working with the optimal 
input distribution Px for the considered compound set 
S, i.e. 

Px ~ are- max inf I(P,W) 

p wes 

(if the maximizers were not to be unique, we pick 
arbitrarily one of them). Therefore, 'miwesI{Px,W) 
is the compound channel capacity for a compound set 
S. However, the results in this paper can be stated for 
arbitrary input distributions (not necessarily optimal), the 
only difference would then be that we would talk about 
mutual informations instead of capacities. 

• For convenience, we assume that S is compact. We define 
Ws = •Aigmiii\Yes I{Px,W), and call it the worst 
channel of S when the minimizer is unique; I{Px, Wg) 
is then the compound channel capacity for a compound 
set S. We make the convention that each time a worst 
channel is considered throughout the paper for any set, 
the set in question is compact. 

• Wo G S denotes the true channel; 

• For a joint distribution ii on X x y; fix and iiy denote 
respectively the X and Y marginal distributions; and 
H'P = nx X I^Y the induced product distribution. Note 
that {nx = Px,IJ-Y = (/^o)f} ^ /^^ = Mo 



• /z = Px o W denotes the joint distribution with Px as 
the X marginal distribution and W as the conditional 
distribution. For example, the mutual information 

IiPx,W) = DiPx oWWiPx oW)P) 

where is the KuUback-Leibler divergence. 

The decoders we consider has the following form. Upon 
receiving y, it computes, for each codeword Xm, a score 
d'^{xm,y), and decodes to the message corresponding to the 
highest score. Here, rf" : A"" x 3^" M is also called a 
decoding metric. Note the restriction here is that the score 
for codeword does not depend on other codewords. Such 
decoders are called a-decoders in [5]. As an example, the 
maximum mutual information (MMI) decoder has a score 
defined as 

where P denotes the empirical distribution. To be specific, 

Va e A", & € 

A^m,!/)(«''') {Xrn{i),y{i)) = {a,h))\ , 

and J(/i) denotes the mutual information, as a function of the 
joint distribution ij, on X x y. 

It is well known that the MMI decoder is universal; when 
used with the optimal code, it achieves the compound channel 
capacity on any compound sets. In fact, there are other 
advantages of the MMI decoder: it does not require the 
knowledge of S; and it achieves universally the random coding 
error exponent [4]. Despite these advantages, the practical 
difficulties to implement an MMI decoder prevents it from 
becoming a real "universally used" decoder. As empirical 
distributions are used in computing the scores, it is difficult to 
store and update the scores, even when a structured codebook 
is used. The main goal of the current paper is to find linear 
decoders that can, like the MMI decoder, be capacity achieving 
on compound channels. 

Definition 1: Linear Decoder 
We refer to a map 

as a single-letter metric. A linear decoder induced by d is 
defined by the decoding mapping: 

Gn{y) = argmax(i"(xm,y) 

rn 

1 " 

where d"(a;„, y) = - V d{xm{i),y{i)) = Ep [d] 

i=l 

Note that the reason why such decoders are called linear 
decoders (d-decoders in [5]) is to underUne the fact that the 
decoding metric is additive, i.e. is a linear function of the 
empirical distribution P(^x„^.y)- The decoding metric for any 
n of a hnear decoder is naturally defined by the single-letter 
metric d through the additive structure. 

The advantages of using linear decoders have been dis- 
cussed thoroughly in [5], [11], [8], and also briefly in the 



introduction. In short, when used with structured codes, one 
can replace the log Ukelihood metric in a conventional decoder 
by a well designed single-letter metric. This way, with little 
changes in the decoder designs, one can have a decoder for 
the compound channel with much less complexity. 

Unfortunately, there are some examples for which no linear 
decoder can achieve the compound capacity. The most well- 
known example is the compound set with two binary sym- 
metric channels, with crossover probabilities of 1 /4 and 3/4, 
respectively. To address the decoding challenge of these cases, 
we will need a slightly more general version of linear decoders. 

Definition 2: Generalized Linear Decoder 
Let di,d2, ■ ■ ■ ,dK be K single-letter metrics, where K is a 
finite number A generalized linear decoder induced by these 
metrics is defined by the decoding map: 

n 

Gn{y) = argmaxV^i V'dfe(a;TO(i),j/(i)) 

m ^ — ' 

1=1 

= arg max vf^j^Ep [dk] 

Note that V denotes the maximum, and it is crucial that K is 
a finite number, which does not depend on the code length n. 

As an example, the maximum likelihood decoder, of a given 
channel W , is a linear decoder induced by 

dML{a,b)=\ogW{h\a), 'iaeX,b&y. 

It is well known that for a given channel W, the ML 
decoder, used with the random codes from the optimal input 
distribution, is capacity achieving. If the channel knowledge is 
imperfect, for example, the decoder uses ML rule for channel 
W\ while the actual channel is Wo, the mismatch in the 
decoding metric causes the achievable data rate to decrease. 
This effect is studied in [5], [11], the result is quoted in the 
following Lemma. For convenience, we also included a brief 
sketch of the proof. 

Lemma 2: [5], [11] For a DMC Wo, using a random 
codebook with input distribution Px, if the decoder is hnear 
and induced by d, the following data rate can be achieved 

R{Px,Wo,d)= iniD{,i\\lxl) (3) 

where /io = Px ° Wo, and //,q is the product distribution 
with the same X and Y marginal distributions as jio and the 
optimization is over the following set of joint distributions on 
Xxy, 

A={ii:nx= Px,tiY = {no)Y,E^[d\ > E^,[d\}- (4) 

As discussed in [11], this expression, even for the optimal Px, 
does not give in general the highest achievable rate under the 
mismatched scenario. If the input alphabet is binary, it does 
so, otherwise it only gives the highest rate that can be achieved 
for codes that are drawn in a random ensemble. 

Proof: This is a simple application of large deviations. 
By a typicality argument, the transmitted codeword, say, xi, 
and the received word y have joint empirical distribution close 



to fiQ, and thus has a score 



III. The Local Geometric Analysis 



for an arbitrarily small 6 > with a high probability when 
n is large enough. Now an error occurs only if there is an 
incorrect codeword, whose score is above 7. For a particular 
codeword, X2, this occurs with probability 



P(d"(x2,y) >7) <exp 



min D(LL\\j£) — 5 



using the fact that X2 is independent of y with an i.i.d. Px 
distribution. The optimization is over the joint distributions /x 
with the correct X and Y marginal distributions. Now applying 
union bound, the probability 

P(3z ^ 1, s.t. y) > 7) < 2"^ • P(rf"(a:2, y) > j). 

Moreover, the empirical distribution of X2,y is arbitrarily close 
to /iQ with probability one. Hence, if i? < R{Px,Wo,d) as 
defined in the lemma's statement, the above probability can 
be made arbitrarily small by taking S small enough. ■ 

With a similar proof as for previous result, the following 
lemma can also be proved. 

Lemma 3: When the true channel is Wq and a generalized 
linear decoder induced by the single-letter metrics {dk}^^i is 
used, we can achieve the following rate 



RiPx,Wo,{dk}Li 



mini:)(/i||^g) 



(5) 



where 



A 



{/i : ^x = Px,^J■Y = (^0)1-, 
Note that R{Px, Wq, {dk}k=i) can equivalently be expressed 



as 



RiPx,Wo,{dk}Li) 



min D{n\\fiP)A...A min D{^i\\^j,^) 

(6) 



where 



{/i : = Px,fJ'Y = (Mo)y, 
E^[dk]>\/f^,E,„[d,]}, 



yi<k< K. 



Now we are ready for the main problem studied in this 
paper For any given compound set S, let the compound 
channel capacity be C{S) and the corresponding optimal input 
distribution be Px- We would like to find K and o?i, . . . , dx, 
such that 

R{Px,Wo,{dk}Li)>CiS) 
for every Wq G S. 

If this holds, the generalized decoder induced by the metrics 
{dk}j^^i is capacity achieving on the compound set S (i.e., 
using analogue arguments as for the achievability proof of the 
compound capacity in [2], there exists a code book that makes 
the overall coding scheme capacity achieving). 



We know that the divergence is not a distance between two 
distributions. However, if its two arguments are close enough, 
the divergence is approximately a squared norm, namely for 
any probability distribution p on Z (where Z is any alphabet) 
and for any v s.t. ^(^)p(^) = 0' we have 



Dip{l + ev)\\p) = ^ „2(^)p(^) ^ ^^^2y 



(7) 



zez 



This is the main tool used in this section. For convenience, 
we define 



which is the squared ^2-iiorm of v, with weight measure p. 
Similarly, we can define the weighted inner product. 



{u,v)p = ^ u{z)v{z)p{z) 

zez 

With these notations, one can write the approximation (j7]l as 

Dipil + sv)\\p)^^^\\v\\l + oie') 

Ignoring the higher order term, the above approximation can 
greatly simplify many optimization problems involving K-L 
divergences. In information theoretic problems dealing with 
discrete channels, such approximation is tight for some special 
cases such as when the channel is very noisy. 
In general, very noisy channel means that the channel output 
weakly depends on the input. If the conditional probability of 
observing any output does not depend on the input (i.e. the 
transition probability matrix has constant columns), we have 
a "pure noise" channel. So a very noisy channel should be 
somehow close to such a pure noise channel. Formally, we 
consider the following family of channels: 

We{b\a)^ PNmi + eLia,b)), 



where L satisfies for any a £ A" 

J2Ha,b)PNib)^0. 

bey 



(8) 



We say that is a very noisy channel if e ^ 1. In this 
case, the conditional distribution of the output, conditioned on 
any input symbol, is close to a distribution Pn (on y), which 
can be thought as the distribution of pure noise. Each of these 
channels, can be viewed as a perturbation from a pure 

noise channel Px, along the direction specified by L{-, •). 

This way of defining very noisy channel can be found 
in [9], [7]. In fact, there are many other possible ways to 
describe a perturbation of distribution. For example, readers 
familiar with [1] might feel it natural to perturb distributions 
along exponential families. Since we are interested only in 
small perturbations, it is not hard to verify that these different 
definitions are indeed equivalent. 



When an input distribution Px is chosen, the corresponding 
output distribution, over the very noisy channel, can be written 

as, yb e y. 



Px{a)WMo) 



PN{b) ll + eJ2Px{a)L{a,b) 

\ a / 



= PNmi + eL{b)) 

where Z(6) = Y.a Px{a)L{a, b), Va G X. 

Hence, a codeword which is sent and the received output have 

components which are i.i.d. from the following distribution 

PxoW, = PxPN{l + eL), 

and similarly, the codeword which is not sent and the received 
output have components which are i.i.d. from the following 
distribution 

{PxoW.Y ^PxPN{l + eL). 

Therefore, the mutual information for very noisy channels is 
given by 

I{Px,W,) = D{PxPN{l + eL)\\PxPN{l + eL)) 



where 



and 



A 



L{a,b) " L{a,b) - L{b), 
which we call the centered directions. 

A. Very Noisy with Mismatched Decoder 

As stated in Lemma |2] for an input distribution Px, a 
mismatched linear decoder induced by the metric d, when the 
true channel is Wq, can achieve the following rate 



■m^D{^^\\^^P) 

fj.eA 



where 



{fi: ^ix = Px, Mr = (Mo)f, Efj^id] > E^_,„ [d]). 

Now, if the channels are very noisy, this achievable rate can 
be expressed in the following simple form. 

Proposition 1: Let Wo,e = -Pjv(l + ^-^o) ™d d^ — 
logVKi^e, where Wi,^ — Pn{^ + £Li). For a given input 
distribution Px, we can achieve the following rate 



2 

e^O e' 



lim^R{Px,Wo.,e,d,)^ ■> lliill^ ' 



il^Shl^ when(Lo,ii> > 
0, otherwise. 



Note that it is w.l.o.g. to consider the single-letter metric to 
be the log of a channel, however, we do restrict all channels 
to be around a common Px distribution. 



Previous result says that the mismatched mutual information 
obtained when decoding with the linear decoder induced by 
the mismatched metric logWi.e, whereas the true channel 
is Wo^e, is approximately the projections' squared norm of 
the true channel centered direction Lq onto the mismatched 
centered direction Li. This result gives an intuitive picture 
of the mismatched mutual information, as expected, if the 
decoder is matched, i.e. Lq = Li, the projections' squared 
norm is ||ioll^^ which is the very noisy mutual information of 
Lq; and the more orthogonal Li is to Lq, the more mismatched 
the decoder is, with a lower achievable rate (eventually 0). 
Proof: For each e, the minimizer can be expressed 

as 

PxPNil + sL) 

where i is a function on A" x 3^, satisfying 



J2 Px{a)PN{b)L{a,b) 
aex.bey 

and the two marginal constraints, resp. 







(Me) 



X 



P. 



X 



(9) 



^PN{b)L{a,b) = 0,Va G X 
bey 

(Me)Y = My 

Y,Pxia)Lia,b)^Y.Pxia)Lo{a,b),ybey (10) 



aex 



Now the constraint ^^^[log M^i.^] > [log can be 

written as 

Pxia)PNib){l + eL{a,b)) 

aex,bey 

■[logPN + log{l + eLi{x,y))] 

> J2 Px{a)PNmi + eLo{a,b)) 
aex.bey 

■[logPN + log{l + eLi{x,y))]. 

Using a first order Taylor expansion for the two log terms, and 
the marginal constraint ( [TO] i, we have that previous constraint 
is equivalent to 



(L,L,) > (Ln,Li 



where 



Finally, we can write the objective function as 

i^(Ms||MS,s) - D{PxPNil + £L)\\PxPNil 
= y ||i~io||p^,P„+o(e2) 



(11) 



(12) 



So we have transformed the original optimization problem 
into the very noisy setting 



lim 



^ inf ) = inf - i^o\ 

e^O t^eA "'^ L:{L,Li)>{Lo,Li) 



L-Lq 



(13) 



where the optimization on the RHS is over L satisfying the 
marginal constraints (|9]l and ( [TO] i. 

Now this optimization can be further simplified. By noticing 
that ( [Tol l implies L = £o>Jve have that L — Lq — L~L, which 
we defined to be L. So L satisfies both marginal constraints 
and the constraint in ( [T3| ) becomes 

(L, Li) > (io, Li) ^ (L, Li) > (Lo, Li) - (Zo, Zi) 

That is, both the objective and the constraint functions are 
now written in terms of centered directions, L. Hence, ( fTS] ) 
becomes 

inf_ ^ \\Lf 

L:{LXi)>{Lo,Li) 

and we can simply recognize that, if {Lq,Li) > 0, the 
minimizer of this expression is obtained by the projection of 
Lq onto Li, with a minimum given by the projections' squared 
norm: _ _ 

(io,^i)' 



\Li\ 



0, leading 



otherwise, if {Lq,Li) < 0, the minimizer is L 
to a zero rate. ■ 
Remark: We have just seen two examples where in the very 
noisy limit, information theoretic quantities have a natural 
geometric meaning, in the previously described inner product 
space. The cases treated in this section are the ones relevant for 
the paper's problem, however, following similar expansions, 
other information theoretic problems, in particular multi-user 
ones (e.g. broadcast or interference channels) can also be 
treated in this geometrical setting. To simplify the notation, 
since the very noisy expressions scale with and have a factor 

1 VN 

I in the limit, we denote by — > the following operator: 

T{s) ^ lim ^Tie). 



We use the abbreviation VN for very noisy. Note that the 
main reason why we use the VN limit in this paper is similar 
somehow to the reason why we consider infinite block length 
in information theory: it gives us a simpler model to analyze 
and helps us understanding the more complex (not necessarily 
very noisy) general model. This makes the VN limit more 
than just an approximation for a specific regime of interest, it 
makes it an analysis tool of our problems, by setting them in a 
geometric framework where notion of distance and angles are 
this time well defined. Moreover, as we will show in section 



V-B in some cases, results proven in the VN limit can in fact 
be "lifted" to results proven in the general cases. 

IV. Linear Decoding for Compound Channel: 
THE Very Noisy Case 

In this section, we will study a special case of the compound 
channel, the very noisy case. The local geometric analysis 
introduced in the previous section can be immediately applied 
to such problems. Throughout this process, we will develop a 
few important concepts that will be used in solving the general 



compound channel problems, in section V-B In the following, 
we first make clear of our assumptions, and introduce some 
notations. 

• All the channels are very noisy, with the same pure noise 
distribution. That is, all considered channels are of the 
form 

We{b\a) = PAr(6)(l + eL{a, b)), yaeX,bey 

where L satisfies J^t PN{b)L{a,b) — 0,Va. The com- 
pound set is hence depending on e, and is expressed as 
= {Pn{1 + eL)\L e 5}, where S is the set of all 
possible directions. Hence, S together with the pure noise 
distribution P/v, completely determine the compound set 
for any e. We refer to S as the compound set in the VN 
setting. Note that S being convex, resp. compact, is the 
sufficient and necessary condition that is convex, resp. 
compact, for all e. 

• Px is fixed (it is the optimal input distribution) and we 
write 

^J^e = PxPN{l + eL),LcS 

as the joint distribution of the input and output over a 
particular channel. For a given channel W^, the output 
distribution is Pn{^ + eL), where 



Z(6) 



L{a,b)Px{a), yhey 



and as before, L = L — L. We then denote S = {L : 
L G S}. Again, the convexity and compactness of S is 
equivalent to those of S. The only difference is that S 
depends on the channels only, whereas S depends on the 
input distribution as well. As we fix Px in this section, 
we use the conditions L e 5 and L G S exchangeably. 
As a convention, we often give an index, j, to the possible 
channels, and we naturally associate the channel index 
(the joint distribution index) and the direction index, i.e. 
Wj,, = Pn{1 + sLj) and = PxPn{1 + eL^). 
In particular, we reserve VFo,e = ^^(1 + ^^o) for the 
true channel and use other indices, Li,L2, etc. for other 
specific channels. 

If one considers the metrics to be the log of some 
channels, i.e., dj = logW^ ^, 

dj- e = log Wj.e = log(PAr) + log(l + eLj). 

In general, the single-letter decoding metric d does not 
have to be the log likelihood of a channel; and even if it 
is, the channel Wj ^ does not have to be in the compound 
set. 

We write all inner products and norms as weighted by 
Px X Pn^ and omit the subscript: 



Finally, 



min I(Px,W) 



^ ■ I 
— mm 

2 Les' 



and we define 



L, 



argmin||Lf , 



to be the worst direction and 1 1 1 P is referred to as the 
very noisy compound channel capacity (on S). 

We conclude this section with the following lemma, which 

will be frequently used in the subsequent. 

Lemma 4: Let Li,Lj,Lk and i; be four directions and 

assume that J2a = '^aPx{a)Lk{a). We then 

have 

Proof: Using a second order Taylor expansion for log(l+ 
eLj), we have 

E^^ ^ log W,,, = J2 PxPn{1 + eLi) log(Pw(l + eL,)) 

^^Px-PjvlogPAT 



e'^^PxPNLj 



(14) 



The only term which is zero in previous summation is the third 
term, namely ^ PxPnLj = 0, which is a consequence of the 
fact that Lj is a direction (i.e. ^ PnLj = 0). Now, when we 



look at the inequality i?^ 



logW,, 



surely simplify the term J2 PxPn log Pn, since it appears 
both on the left and right hand side. Moreover, using the as- 
sumption that J2a Px{'^)Li{a) = J2a Px{0')Lk{a), we have 
Y^PxPNLilogPN = Y^PxPNLklogPN- Hence the only 
terms that survive in ( [T4] l, when computing E^^. ^ log Wj,e > 
E^^ ^ log W;^£, are the terms in e^, which proves the lemma. 



A. One-sided Sets 

We consider for now the use of linear decoder (i.e., induced 
by only one metric). We recall that, as proved in previous 
section, for PVo.e — Pn{^ + sLq) and — log where 
Wi^s = -Pjv(l + £Li), we have 



lim^^i?(Px,Wo,e,de) 




when (Lq, Li) > 
otherwise. 



This picture of the mismatched mutual information directly 
suggests a first result. Assume S, hence S, to be convex. By 
using the worse channel to be the only decoding metric, it is 
then clear that the VN compound capacity can be achieved. 
In fact, no matter what the true channel Lq E S is, the 
mismatched mutual information given by the projections' 
squared norm of Lq onto Lg cannot be shorter than HL^jp, 
which is the very noisy compound capacity of S (cf. Figure 
[TJ. This agrees with a result proved in [5]. 

However, with this picture we understand that the notion of 
convexity is not necessary. As long as the compound set is 
such that its projection in the direction of the minimal vector 
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Fig. 1. Very noisy one-sided compound set: in this figure, S is tlie union 
of three sets. The linear decoder induced by the worst channel metric log Lg 
when the true channel is Lo affords reliable communication for rates as large 
as the squared norm of the projection of Lq onto Lg. From the one-sided 
shape of the compound set, this projections' squared norm is always as large 
as the compound capacity given by the squared norm of Lg. 



Stays on one side, i.e., if the compound set is entirely contained 
in the half space delimited by the normal j)lan^ to the minimal 
vector, i.e., if for any Lq E S, we have {Lq, L5) > and : 



>\\Ls\ 



we will achieve compound capacity by using the linear decoder 
induced by the worst channel metric (cf. figure [T] where S is 
not convex but still verifies the above conditions). We call such 
sets one-sided sets, as defined in the following. 

Definition 3: VN One-sided Set 
A VN compound set S is one-sided iff for any Lq E S, we 
have 



(Lq.Ls) > 0, 
{LQ,Li 



\Ls\ 



> WL. 



(15) 
(16) 



Equivalently, a VN compound set S is one-sided iff for any 
Lq E S, we have 



\Lq\\^-\\Ls\\^-\\Lq-Ls\\^>0. 



(17) 



Proposition 2: In the VN setting, the linear decoder in- 
duced by the worst channel metric log Lg is capacity achieving 
for one-sided sets. 

The very noisy picture also suggests that the one-sided prop- 
erty is indeed necessary in order to be able to achieve the 
compound capacity with a single linear decoder However, our 
main goal here is not motivated by results of this kind and 
we will not discuss this in more details. We now investigate 
whether we can still achieve compound capacity on non one- 
sided compound sets, by using generalized linear decoders. 



B. Finite Sets 

Let us consider a simple case of non one-sided set, namely 
when S contains only two channels that are not satisfying the 
one-sided property in ( [T7| ). We denote the set by 

S^{Wo,Wi}. 

and it contains the true channel Wq and an arbitrary other 
channel Wi. A first idea is to use a generalized decoder 
induced by the two metrics di = logWo and d2 — logpl^i, 
i.e. decoding with the GLRT test using both channels, which 
defines the following decoding map 

aigma.xW^{y\xrn) V W^{y\xm). 

The maximization of WQ{y\xm) corresponds to the maxi- 
mization of an optimal ML decoder with the true channel, 
whereas the maximization of VF"(t/|xm) corresponds to the 
maximization of ML decoder with a mismatched metric, which 
may have nothing to do with the true channel metric. So we 
need to estimate how probable it is that a codeword which has 
not been sent appears highly plausible under the mismatched 
metric (i.e., an error event). Using, (|6|, we can achieve the 
following rate with such a decoder: 

Ro^Ri, (18) 

where = min 15(^*11^0). ^ = 0, 1 (19) 

fj.eAk 



and 

Ak = 



log Wk > v]=o^Mo log W^j } , Vfc = 0, L 
Efj_g log Wo, hence the expres- 



Note that V]^(,i;^(, log Wj 
sion of Ak simplifies to 

Ak = {n : fix = Px-fJ-Y = {f^o)Y, 

log Wk > E^„ log Wo}, Vfc = 0, 1. 

Moreover, the compound capacity of S is given here by 

CiS)^IiPx,Wo)AliPx,Wi). 

We know that Rq is the mutual information of Wq, i.e. Rq — 
I{Px, Wq) (since it is the rate achieved with a ML decoder 
with a metric matched to the channel, as explained previously). 
So the generalized decoder that we are considering achieves 
compound capacity if Ri > C{S). We check this here in 
the very noisy setting. We use the notations and conventions 
defined previously for the VN setting, and to compute the VN 
limit of we need the VN limits of iD(/ie||/iQ ^) and Ai,e- 
We have 



\L-L„ 



Moreover (/ie)x — Px for any e, since we assume that L 
satisfies L(a, 6)P7v(6) = and 



Finally, using lemma |4] we have 

E^^\ogW^.^> E^, \ogWQ, 



{L,Li 



Hence 



Ai,e ^{L: L^Lo, {L,L,) > -iWLof + \\L,\\')} 



\L,f)-{Lo,Li)}. (20) 



^{L: {L,L,) > -i\\Lo\ 

Note that we used Z = Zq to get ( |20l ) from its previous line. 
Putting pieces together we get 

Ri e _ niin _ _ ||i — Lo|p 

L:L=Lo,(-t,ii>>|(l|ioP + l|iiP)-(io.ii> 

= min ||Zf. 

L: (L,Li)>i(||LolP + ||iiP)-(Lo,Li) 

We now are able to resolve the above minimization, and we 
get 

2 , II r \12\ '5= T vn2 



VN 



[^i\\Lor + \\L,r)-(Lo,L,)y 



Also, 



\Li\ 



Therefore, the inequality which allows us to verify locally if 
the proposed decoding rule achieves compound capacity, i.e. 
if Ri > C{S) in the VN setting, is given by 

VN 



[i(||Lo|P + ||ii|P)-(io,ii) 



l^il 



> llioir AllLiI 



(21) 



But 



-{\\Lof + \\L,f)-{Lo,L,) 
^i(||Zo|p + ||Zif + ||Lo-Liin, 



hence, ( |2T] i is equivalent to 
1 



\Li\ 



Lo~Lif) > ||Lo||||Li|| AllLiI 



which clearly holds no matter what Lq and Li are. 

This can be directly generalized to any finite sets and we have 

the following result. 

Proposition 3: In the VN setting, GLRT with all channels 
in the set is capacity achieving for finite compound sets, and 
generalized linear. 

C. Finite Union of One-sided Sets 

1) Using ML Metrics: In the previous sections, we have 
found linear, or generalized linear, decoders that are capacity 
achieving for one-sided sets and for finite sets. Next we 
consider compound sets that are finite unions of one-sided sets 
and hope to combine our results in these two cases. Assume 

S=SiU S2, 




Fig. 2. A VN Compound set which is the union of two one-sided components, 
Si and 52, drawn in the space of centered directions (tilde vectors) 



where and 5*2 are one-sided: in this section we consider 
only the VN setting, hence saying that 5*1 is one sided really 
means that the VN compound set Si corresponding to 5*1 ^ is 
one-sided according to Definition |3] 

For a fixed input distribution Px, let Wi — M^s^ and 
W2 — be the worst channel of 51,5*2, respectively.(cf. 
figure |2]|. A plausible candidate for a generalized linear 
universal decoder the GLRT with metrics di = log Wi and 
d2 — log W2, hoping that a combination of earlier results for 
finite and one-sided sets would make this decoder capacity 
achieving. Say w.l.o.g. that Wq G Si. Using (|6]l, the following 
rate can be achieved with the proposed decoding rule: 



where 



RiPx,Wo,{dkK=i)^ RiAR 



Rk=in^D{fl\\^lP), k=l,2 
Ak 



and for k = 1,2, 

Ak^{^^: - M?;, log Wk > vLiS,.„ log Wi}, (22) 

Note that we are using similar notations for this section as 
for the previous one, although the sets Ak and rates Rk are 
now given by different expressions. We also use fi^ — Hq to 
express in a more compact way that the marginals of /i and 
fiQ are the same. 

Since Wi and W2 are the worst channel for Px in each 
component, the compound capacity over 5 = 5i U 52 is 

C{S)=I{Px,Wi)Al{Px,W2). 

In the finite compound set case of previous section, we further 
simplified the expression of the .4fc's, since we the maximum 
in '^f^iE^glogWi could be identified. This is no longer the 
case here, and we have to consider both cases, i.e.: 

Case 1: log Wi > E^, log W2 (23) 
Case 2: E^, log < E^, log W2 . (24) 

In order to verify that the decoder is capacity achieving, we 
need to check if both i?i and R2 are greater than or equal 
to the compound capacity C(5), no matter which of case 



1 or case 2 occurs. Thus, there are totally 4 inequalities to 
check. While checking these cases is somewhat tedious, we 
will, in the following, go through each of them carefully and 
point out a specific case that is problematic, before giving 
a counterexample where GLRT with the worst channels is in 
fact not capacity achieving. Later when we propose a capacity 
achieving decoder, we will go through a similar procedure in 
a more concise way. 
Note that under case 1, 

For case I: Ai = {^l : fx^ ^ a^o. log Wi > E^„ log Wi}, 

which has the form of the constraint set for R{Px,Wo,di) 
expressed in Q. Hence we have 



For case 1: i?i = R{Px,WQ,logWi). 



(25) 



As shown in section IV-A| R{Px, Wq, log Wi) becomes in the 
VN Hmit: 



i?(Px,Wo,e,logW^l,e) 



VN {Lq,Li)' 



\Li\? 



(26) 



(note that since Si is one-sided, {Lq,Li) > 0). Also, in the 
VN limit, C(5e) becomes ||Zi|p A IIZ2IP, hence 



For case 1: > C(5e) 



VN 



(-^0; Ll 

\\LiW' 



>l|ii|pA||L2|p. 

(27) 



But we assumed that Si is one-sided and that Li is the worst 
direction of Si. Moreover, we assumed that Wq E Si, i.e. 
Lq S 5i . Hence. (|27| holds by definition of one-sided sets, 
cf. def. (it] (with this definition, holds with on the 

right hand side, hence it holds for A ||i2|P)- 

For case 2, i.e. when E^^^ log Wi < E^^ log W2, we have 
Ri = inf^g_4j D{^\\ii^), where this time Ai is given by 

For case 2: Ai ^ {^x : = ^xl, E,, log Wi > E^, log W2} (28) 

Note that, by definition of case 2, the constraint set Ai is 
smaller than the constraint set B given below: 

Ai^{fi:^iP^ mS, E^ log Wi > E,,, log W2} 
C 6 = {/i : - mS, E^ log Wi > E^^ log Wi} (29) 

hence, 

inf Di^,\\t,^„)>M D{^,\\,,P). 

But B is the constraint set appearing in R{Px ,Wo, log Wi), 
which means that 

miD{^i\\^,P) = RiPx,Wo,logWi), 

therefore, under case 2, we showed that Ri > 
R{Px ,Wo, log Wi). Now, as shown before, 
R{Px ,Wo, log Wi) is locally lower bounded by 
I{Px,Wi) > C(5), by the one-sided assumption on 
Si. 



Hence, we have just shown that i?i > C{S), both under 
case 1 and 2. 

Next, we check whether i?2 = inf^g^, D{^\\iiq) > C{S) 
holds or not. We have again to check this for case 1 and 2. 
This time we start with case 2. Note that the expression of 
R2 in case 2 is perfectly symmetric to the expression of Ri 
in case 1, we just have to swap the indices 1 and 2, hence 

For case 2: R2 = R{Px, W^o, log W2). 
and the inequality we need to check in the very noisy case is 



For case 2: i?2,e > C(S'e) 



> ll^ilr AIIL2 



(30) 

However, the one-sided property does not apply anymore, 
since we assumed that Lq belongs to Si and not 52- Indeed, 
if we have no restriction on the positions of Lq and L2, ( |30l ) 
can be zero. Comparing this with the case of a single one- 
sided set, we see this is exactly the difficulty of analyzing 
generalized linear decoders. Using multiple metrics, especially 
d2 — logW2, which does not have any one-sided relation 
with the actual channel Wq, causes an extra chance of mak- 
ing errors: an incorrect codeword can appear very plausible 
according to metric d2- The probability for this to happen is 
captured by the rate i?2- On the other hand, there is also a 
lowei^ target: ([30| should not hold for any possible Lq, Li 
and L2, ([30| should hold when these centered directions are 
satisfying case 2. Moreover, the compound capacity is now the 
minimum between the mutual informations and ||i^2 IP- 

One might hope that the combination of all these effects leads 
to i?2 > C{S) and hence a capacity achieving decoder design. 
Unfortunately, this is not the case. 

Proposition 4: In the VN setting and for compound sets 
having a finite number of one-sided components, GLRT with 
the worst channel of each component is not capacity achieving. 

Counterexample: Let X = y = {0,1}, Px = Pn = 
{1/2,1/2}, 



-2 
-7 



,^1 



2 -2 




and L2 



-1 
1 



1 

-1 



The achievable rate can be easily checked with this coun- 
terexample, and in fact there are many other examples that 
one can construct. We will, in following, discuss the geometric 
insights that leads to these counterexamples (and check that 
it is indeed a counterexample). This will also be valuable in 
constructing better decoders in the next section. 

We first use Lemma |4] to write 

E^,^ogWi,, < E^„ JogW2,e ^ \\Lo-L2\\ < \\La-Lil 

which can be use to rewrite ( |23| l and ( (24] i in the very noisy 
setting as 



Case 1: ||Lo — -^2! 
Case 2: IlLo — L2I 



> \\La-Li\ 
< \\Lo-Li\ 



(31) 
(32) 



Now to construct a counterexample, we consider the special 
case where ||io ^ -^2!! = ||-^o ~ and \\Li\\ = ||i2||- These 
assumptions are used to simplify our discussion, and are not 
necessary in constructing counterexamples. One can check that 
the above example satisfies both assumptions. Now ([30| holds 
if and only if 

{Lo,L2) ■' 



\L2 



which is equivalent to 



\Lor-\\L2r 



> L2 



\\Lo~L2f>0. 



It is easy to check that the last inequality does not hold 
for the given counterexample, which completes the proof of 
Proposition |4] In fact, one can write 



l^ol 



\L2 



\La — L2 



= ||Lo|p-jii|p-j|io-ii|| 

+ ||i0-il|P- 11^0-^211' 



(33) 



The term on the second line above is always positive (by the 
one-sided property), but we have a problem with the term 
on the last line: we assumed that ||_Lo ^ -^2!! = ||-^o ^ 
and this does not imply that ||Lo — = ||io ^ -^2|P The 
problem here is that when using log likelihood functions as 
decoding metrics, the constraints in ( |22] i, ( |23] l and ( |24| i are, 
in the very noisy case, given in terms of the perturbation 
directions Li,i = 0,1,2, while the desired statement about 
achievable rates and the compound capacity are given in 
terms of the centered directions Li's. Thus, counterexamples 
can be constructed by carefully assign Z^'s to be different, 
hence the constraints on i/s cannot effectively regulate the 
behavior of Z/s (( (33] l can be made negative). Figure plgives a 
pictorial illustration of this phenomenon. The above discussion 
also suggests a fix to the problem. If one could replace the 
constraints on Ljs in p2l),(p3]l and (|24]), by the corresponding 
constraints on L/s, that might at least allow better controls 
over the achievable rates. This is indeed possible by making a 
small change of the decoding metrics, as done in the following 
section. 

2) Using MAP Metrics: We now use different metrics than 
the one used in previous section, instead of the ML metrics 
given by log VF^, we use the metrics 



log 



(34) 



which we call the MAP metrics for maximum a posteriori 
and which may also be referred as the Fano metrics in the 
literature. 



As before, let us consider Wo, Wi and W2 such that Wi and 
W2 are the worst channels of two one-sided components Si 
and S2, and Wq belongs to Si. Using (|6]), with di = log ^-^^^ 

and d2 = log ^^^^ , the proposed generalized linear decoder 
can achieve 

i?(Px, Wo, {4}Li) = -Ri A i?2 




Fig. 3. This figure illustrates a counterexample, for binary VN channels, to 
the claim that GLRT with the worst channel metrics is capacity achieving. As 
illustrated, a condition on the non centered directions, such as \\Lo — Li\\ = 
11^0 — ^2 II , does not influence the position of the centered directions and can 
allow Z/2 and Lq to be opposite, violating the desired inequality in {30). 



where 



Rk 



infD(MllA^g), 



and for k ~ 1,2 

Ak = \fJ-: fJ.^ = fjl^,E^log 



Wk 



1.2 



Wi 



Note that again, we use same notations for this section as 
for the previous one, ahhough the sets Ak and rates Rk are 
now given by different expressions. Since Wi and W2 are 
the worst channel for Px in each component, the compound 
capacity over 5 = 5i U 6*2 is still given by 

C{S)^IiPx,Wi)AliPx,W2). 

As we we did for ( |23] l and ( |24] l, we consider separately two 
cases: 



Case 1: E, 



Wi 



> E 



W2 



Case 2: E^„j^ < Ef,„j^. 



(35) 
(36) 



Following the same argument as in the last section, we 
verify that i?i > C{S) under both cases. Note that in case 



1, the constraint in Ai is -E^^log 



Wi 



Wi 



(mi)!- — ^0 ^ (mi)y ■ 

Comparing this with its counterpart for in ML decoding, the 
only difference is the extra Elog{fii)Y terms on both sides. 
Noticing that /i and /ip have the same Y marginal distribution, 
we see that the optimization problem is exactly the same 
as before, and thus the achievable rate is the mismatched 
rate R{Px ,Wo, log Wi), which by the one-sided assumption 
Wo e 5"! is higher than I{Px,Wi). In case 2, Ri > C{S) 
follows since (|36|) gives a more stringent constraint in Ai, and 



hence a higher achievable rate (conf (p9])). Hence, just like 
it was the case for the ML decoding metrics, Ri > C{S) is 
easily checked with the one-sided property. We now show that 
as opposed to the ML case, with the MAP metrics, we also 
have i?2 > C{S). 

The main difference between the proposed MAP decoding 
metric and the ML metric used in the previous section can 
be seen clearly from the very noisy setting. Using a similar 
argument as in Lemma |4] we have 



^MO.. log 



VN 



(A*l,e)i 



(^0, El 



\Li\ 



\Lo\ 



\Lo-L,f). (37) 



Thus, the optimization in Rk are over the sets 



lfc,e 



^{L:L^Lo: 



(38) 

2\ 



{L,Lk) - IwLkf > vLi^dliof - 11^0 - L,r)} 
and the two cases to be considered are 



Case 1: llfo-^i||' < llfo-^2|P (39) 
Case 2: ||io - > ||io - ^211'. (40) 



These expressions are almost the same as the ones for the 
ML metric, the very noisy version of ( p2| ), ( (3T] i, and ( (32] i, 
except now we have the conditions on the centered directions 
(tilde vectors). As discussed in the proof of Proposition |4] 
this change is precisely what is needed to avoid the counter 
example. It turns out that this change is also sufficient for the 
decoder to be capacity achieving. 

Now what remains to be proved is that R2 > C{S). Using 
( [37] ), and noticing the marginal constraints, we have for case 
1 



VN 



■2,e 



min ^ _ ll^ll 

L:L=Lo,(i,i2)>|(||ioP + ||i2|P-||io-iiP) 



and for case 2 

VN 



R 



2,E 



min ^ ^ ^ \\L\\ 

L:L=Lo,(L,L2)>i(||Lo|P + ||L2|P-||Lo-L2|P) 



These optimizations can be explicitly solved as projections: 

|Z2||2-||Zo-Zl||2)2 



For Case 1: i?2,e 



For Case 2: i?2,e 



' 4 

HWLor- 



VN 



1^0-^211') 



2\2 



Recalhng that the compound capacity is given by 



we have 



Case 1: i?2,e > C(S'e) 



VN l\\Lor_ 
' 2 



11^2 



1^0 -ill 



Case 2: i?2,e > C(S'e) 

I IIZ0IP+IIZ2IP-IIL0-Z2I 



> lliill AIIL2II (41) 



> lliill AIIL2II (42) 



and we now check that inequalities ( |4T] ) and ( |42] i hold with 
||Li|| instead of \\Li\\ A 1 1^211 on the right hand side. 
Starting with (|42]i, we write 



IL2 



2||ii||||i2 



-llilll = 



1^211)^ 



1^211 



\Li\ 



\\L2\\ 

mLi\\-\\L2\\r + \\Lor-\\L,r-\\Lo-L^r) 



m 



> 



where last inequality follows from the one-sided property 

(Zo,Zi>' 



> ll^ilr ^ 11^0 



\\Li 

For ( |4l] i, the same expansion gets us directly to 



iill'>0 



\L,\\^ - \\Lo - L,\\^) 



hmi\ 



JL2\\ 

\L2\\r 



-\\Li\\ = 

ZilP-llZo- 



\L2 



> 



again by the one-sided property. Now combining these results, 
we get that the GMAP decoder is capacity achieving for the 
VN case. The result can be easily generalized to cases with 
more than two one-sided components. 

Discussions: 

The above derivations can also be viewed from a pictorial 
way. We take case 2 for R2 for example. The one-sided 
constraint {Lq, Li) > says that Lq lies on the right side 

of Li\ but the constraint for case 2 , (|40]l, precisely implies 
that L2 can only lie in the smaller circle centered at Lq, as in 
Figure |4] but the small circle intersect the large circle only in 
the hatched region, where 



{Lq, L2 
IIZ2II 



> ll^ill A \\L2 



(43) 




Fig. 4. Location of L2 where \43) holds. 




Fig. 5. Location of L2 where \4i\ does not hold. 




holds. On the other hand, if we work with the ML metrics, the 
constraint for case 2 is given by ( |32j i, and how we showed it 
in the j;ounterexample of section [TV-C.ll this does no longer as in Figure |5] 
force L2 to lie inside the smaller circle centered at Lq, hence 
inside the hatched region, as Figure [5] and |6] illustrates it. 



Fig. 6. This figure illustrates that on a 3-ary input/output VN channels, 
the non centered directions (living in the 3D space) verify ||Lo — Li|| = 
11^0 ~ ^2 II . but this does not influence the position of the centered directions 
(in the 2D plane) and indeed L2 is in the region where {43) does not hold, 



It is insightful to try to understand the reason that the GMAP 
decoder works well while the GLRT fails. For a linear decoder 
with a single metric d : X x y i-^ R, if one forms a different 
test by picking d'{x,y) — d{x,y) + f{y), for some function 
/ : 3^ M, it is not hard to see that the resulting decision 
is exactly the same, for every possible received signal y. This 
is why the ML decoder and the MAP decoder, from the same 
mismatched channel Wi, are indeed equivalent, as they differ 
by a factor of / = log(Px o Wi)y- For a generalized linear 
decoder with multiple metrics, di,d2, ■ ■ ■ , dx, if one changes 
the metrics to di + f,d2 + f, . . . , d^ + /, for the same function 
/ on y, again the resulting decoder is the same. Things are 
different, however, if one changes these metrics by different 
functions, to have di + /i, . . . , dx + /k- The problem is that 
this changes the balance between the metrics, which as we 
observed in the GMAP story, is critical for the generalized 
linear decoder to work properly. For example, if one adds a 
big number on one of the metrics to make it always dominate 
the others, the purpose of using multiple metrics is defeated. 
GLRT differs from the GMAP decoder by factors of log(/ife)y 
on the k*^ metric, which causes a bias depending the received 
signal y. The counter example we presented in the precious 
section is in essence constructed to illustrate the effect of such 
bias. Through a similar approach, one can indeed show that the 
GMAP receiver is the unique generalized linear receiver, based 
on the worst channels of different one-sided components, in 
the sense that any non-trivial variation of these metrics, i.e., 
fi, ■ ■ ■ , Jk which are not the same function, would result in 
a receiver that does not achieve the compound capacity in all 
cases. Counter examples can always be constructed in a similar 
fashion. 



Definition 4: One-sided Set 

A set S is one-sided, if 

^(Moll/i^) >^(Aio||/Us) + i?(MsllM^), yWoeS. (44) 
where 

M^s = arg min I(Px,W). (45) 

WGcl(S) 

and fiQ — Px o Wq, fis — Px ° Ws, are the joint distribution 
over the channel Wq and Ws, respectively. 
Note that in order for (|44]) to hold, the minimizer in ( |45| l must 
be unique. 

Proposition 5: For one-sided sets S, the linear decoder 
induced by the metric d — log Ws is capacity achieving. 
Note that in [5], the same linear decoder is proved to be 
capacity achieving for the case where S is convex. 

Proposition 6: Convex sets are one-sided and there exist 
one-sided sets that are not convex. 

Proposition 7: For any set S, the decoder maximizing the 
score function G„ = sup ^i/g 5 log 14^", is capacity achieving, 
but generalized linear only if S is finite. 

Proposition 8: For S — U^j^^fe, where {Sk}^^i are one- 
sided sets, the generalized linear decoder induced by the 
metrics dk — log Ws^, , for 1 < k < K, is not capacity 
achieving (in general). 

The following Theorem is the main result of the paper. 

Theorem 1: For S — U^^S'^, where {S'fcj^i are one- 
sided sets, the generalized linear decoder induced by the 
metrics dk — log y for I < k < K is capacity- 

achieving. 



V. Linear Decoding for Compound Channel: 
The General Case 

A. The Results 

The previous section gives us a series of results regarding 
linear decoders on different kinds of compound sets, in the 
very noisy setting. While focusing on special channels, the 
geometric insights we developed in the previous section is 
clearly helpful in understanding the problem in general. In 
this section, we will show that indeed most of the results 
reported in the previous section have "natural" counterparts in 
the general not very noisy cases. Moreover, the proofs of these 
general results often proceed in a step by step correspondence 
with that for the very noisy case. We often refer to such 
procedure of generalizing the results from the very noisy case 
to the general cases, as "lifting". In the following, we will first 



list all the general results, and give proofs in section V-B 
Recall the optimal input distribution of a set S by 

Pv- = arg max inf I{P,W), 

PeMt{x)WeS 



and if the maximizers are not unique, we define Px to be any 
arbitrary maximizer. 



B. Proofs: Lifting Local to Global Results 

In this section, we illustrate how the results and proofs 
obtained in section |IV] in the very noisy setting can be lifted 
to results and proofs in the general setting. We first consider 
the case of one-sided sets. By revisiting the definitions made 
in section IV-A we will try to develop a "naturally" corre- 



sponding notion of one-sidedness for the general problems. 
By definition of a VN one-sided set, S is such that 



jiolP - \\Lsf -\\Lo- Lsf > 0, VLo e S. 



(46) 



Next, we find the divergences, for the general problems, whose 
very noisy representations are these norms: recall that 



■D(moIIa'o) 



VN 



\Ln — Ln 



and 



DMf^''s)^\\Ls-Ls\ 
On the other hand, we also have 

VN 



ll^sll^ 



(47) 



(48) 
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Lq — Li 



and 



D{^Jil\\^il)^\\L,~Ls 



and hence 



\\Ln — LsW'^ — WLn 



VN 



(49) 



Lsf = \\Lo-Lsf 



where the last equaUty simply uses the projection principle, 
i.e., that the projection of L onto the centered directions given 
by L = L — Z, is orthogonal to the projection's height L, 
implying 

Now, by reversing the very noisy approximation in ( |47] i, 
(|48|l and (|49]l, we get that 



for all Wq e S, can be viewed as a "natural" counterpart 
of ^V7\ , hence of the VN one-sided definition. With a little 
simplification, this inequality is equivalent to 

D{^io\\^i''s)>D{^lo\\^ls) + D{^ls\\^^l), ywoes. (50) 

Therefore, we use this as the definition of the general one- 
sided sets, as expressed in Definition [4] 

Clearly, as we mechanically generalized the notion of one- 
sided sets from a special very noisy case to the general 
problem, there is no reason to believe at this point that the 
resulting one-sided sets will have the same property in the 
general setting, than their counterparts in the very noisy case; 
namely, that the linear decoder induced from the worst channel 
achieves the compound capacity. However, this turns out to 
be true, and the proof again follows closely the corresponding 
proof of the very noisy special case. 

Proof: of Proposition [5] 
Recall that in the VN case, when the actual channel is Wq^^, 
and the decoder uses metric = logWi ^, the achievable 
rate, in terms of the corresponding centered directions Lq, Li, 
is given by, cf. ([TSj, 



^inf^ ^ (51) 

L:{L,Li)>{Lo,Lt) 



The constraint of the optimization can be rewritten in norms 

as 

Z: |lZ||2-||Z-Zi||2> ||Zo|p-||Zo-Zi||2 (52) 

Now if Lq lies in a one-sided set S, and we use decoding 
metric as the worst channel Li = Ls, by using definition 
( |46l l, and recognizing that ||L — is non-negative, this 

constraint implies 

\\L\\^>\\Lo\\^-\\Lo-Ls\\^>\\Lsf, vZo G 5, (53) 

form which we conclude that the compound capacity is 
achievable. The proof of Proposition |5] replicates these steps 
closely. 

First, we write in the general setting, the mismatched mutual 
information is given by 

RiPx,Wo,logWs)= inf D{^l\\^lP) (54) 



where 

As = {fix = Px,liY = MY,Ef, log Ws > Ef,„ log Ws}. 

Since we consider here a linear decoder, i.e. induced by 
only one single-letter metric, we can consider equivalently the 
ML or MAP metrics. We then work with the MAP metric and 
the constraint set is equivalently expressed as: 

Ws ^ Ws 



A-s = {mx = Px,fJ'Y ^ ifJ'0)Y,E^log 



> E„ 



Expressing the quantities of interest in terms of divergences. 



}■ 



we write 



E„ 



Ws 

(Ms)y 

, Ws MM^ 
log 



= Dit,yn - Difi^s) + DitVs) 
Similarly we have 

Ws 



E„ 



D0,oH)-D{f,o\\t^s) + D{n2f^l). 



Thus we can rewrite ^5 as 

As = {fi: fix ^ Px,fJ-Y = (Aio)y 

> Difio^f,) - DifioWfis) + ^(mSIIa*^)} (55) 



It worth noticing that this is precisely the lifting of ( 

Now, in the VN limit, - is given by 



\\L - LsV - \\L - Lsr -- 
positive. Here, we have that 



L 



which is clearly 



D{tiys)~DifiP\\fil)>0, 

is a direct consequence of log-sum inequality, and with this, 
we can write for all /i e As, 

DifiWfiP) > DifioH) - i?(Mo||Ms) + 

which is in turn lower bounded by DdisWfi^) = I{Px, Ws), 
provided that the set 5 is one-sided, cf. Q (note that last lines 
are again a lifting of (|53|l). Thus, the compound capacity is 
achieved. ■ 
This general proof can indeed be shortened. Here, we 
emphasize the correspondence with the proof for the very 
noisy case, in order to demonstrate the insights one obtains 
by using the local geometric analysis. 

Proof: of Lemma |6] 
Let C a convex set, then for any input distribution Px the 

set D = {fJ,\fJ.{a, b) = Px{a)W{b\a),W G C} is a convex set 
as well. For /i such that ii{a,b) = Px{a)W{b\a), we have 

= I{Px,W) + D{fiY\\{f^c)Y), 

hence we obtain, by definition of Wc being the worse channel 
of cl(C), 

He ^ mm D{fi\\fjP.). 

fj.eci{D) 



Therefore, we can use theorem 3.1. in [3] and for any fiQ E D, 
we have the pythagorean inequality for convex sets 

DifioWfi'c) > DifioWf^c) + Difich^'c)- (56) 
This concludes the proof of the first claim of the Proposition. 
Now to construct a one-sided set that is not convex, one can 
simply take a convex set and remove one point in the interior, 
to create a "hole". This does not affect the one-sidedness, but 
makes the set non-convex. It also shows that there are sets 
that are one-sided (and not convex) for all input distributions, 
so the one-sidedness does not have to depend on which input 
distribution is chosen. ■ 
Proposition |6] says that our definition of one-sided sets is 
strictly more general than convex sets. This generalizes the 
known result [5] on when does linear receiver achieve com- 
pound capacity, but more importantly, our definition leads to 
the meaningful use of generalized linear decoders with finite 
number of metrics: it is easy to construct an example of 
compound set with an infinite number of disconnected convex 
components; but the notion of finite unions of one-sided sets 
is general enough to include most compound sets that one can 
be exposed to. 

In the next proofs, we no longer give explicitly the analogy 
with the VN setting. 

Proof: of Proposition 
We need to show the following 

AvvisS „ inf ^(mIImS) 

fj.: tJ.P=H^,E^\ogWi>VivesEf,g log W 

>AwGsI{Px,W), 

and we will see that the left hand side of this inequality is equal 

to liPx^Wo). Note that \/wesE^„logW ^ E^,„logWo = 
I{Px,Wo). Thus, the desired inequality is equivalent to 

VWi e S, 

inf DifiWfil) > AwesIiPx,W). (57) 

Using the marginal constraint /i^ = /ig, we have 

E^logWi>E^„logWo 
Ml' 



logVFt 




Fig. 7. This figure represent the left hand side of {57) . It indeed represents 
two cases: when Wi = Wo and when Wi is an arbitrary channel in S. The 
planes in the figure represent the constraint sets appearing in the optimization 
for each of these cases. The fact that the twisted plane is not tangent to the 
divergence ball with radius D {fj,o\\ fi^) illustrates the gap pointed out in the 
proof of Proposition jTJ. 



log ■ 



> E„ 



1 Mo 
Mo 



using the fact that > 0, we have 

inf Di^i^P) 

f,:^LP=tiP,E^ logWi>E^g log Wo 

inf D{fi\\fiP) 

fi: f.iP=f.if,,D(fi\\fiP)-D{fj.\\fii)>D{fio\\^i^) 

>D{^io\\^if>)^I{Px,Wo). 



(58) 



(59) 
(60) 



and using the log-sum inequality to show that Z?(/i||/ii) — 
^'(M^llMf) > 0, (|59]) is lower bounded by 

D(A*o||Mg) + C(MgllM?). 

Figure |7] illustrates this gap. ■ 

Proof: of Proposition |S] 

We found a counter-example for the very noisy setting in 
section |IV-C| therefore the negative statement holds in the 
general setting. ■ 

Proof: of Theoren^ 
We need to show 



This concludes the proof of the Proposition. In fact, one could 
get a tighter lower bound by expressing ( |58| l as 

SplogM^i > ^^ologW^o ^ 
D{p\\pF) ~ {D{p\\p,) ^ D{p^^i\)) 

>D(/io||Mg)+i?(M[;ilM?), 



■mi D{p\\pl)> A^^J{Px,Wk), 



where A contains all joint distributions p such that 



(61) 



iJ'X = Px, f^Y = (Mo)y, (62) 

Wk K Wk 



wtiE^ log ^-f- > \/LiE^o log . . 



We can assume w.l.o.g. that Wq G Ci. We then have 



(B) 



Wk 



(D) 

> E„ , log 



Wi 
Wi 



(Mi)y 
= nPx,Wi) 

where (A) uses (|62]i, (B) uses the log-sum inequality: 



Wi. Wi. 

E^iog-^ = D{^i\\^i^) + E^\og-^-D{^i\\^i^) 
= Dif,\\^^p)-iDi^,\\^ik)-Di^,p\\^,l)), 

V ^ / 

>o 

(C) is simply ( |63] l and (D) follows from the one-sided prop- 
erty: 

^Mo log 777^ ^Mi log 



= i?(Mo||M?)-i^(Mollm)-i?(/ii||/i5') 
> 0. 



C. Discussions 

We raised the question whether it is possible for a decoder to 
be both linear and capacity achieving on compound channels. 
We showed that if the compound set is a union of one- 
sided sets, a generalized linear which is capacity achieving 
decoder exists. We constructed it as follows: if Wi, . . . , Wk 
are the worst channels of each component (cf figure [8]l, use 
the generalized linear decoder induced by the MAP metrics 



logz^,. 



,log 



Wk 



i.e., decode with 



Gn{y) = arg 



max vfLiSp 
2e{i,...,Ai} 



, Wk 

(^J'k)Y 



where /i^ = Px o Wk, Px is the optimal input distribution 
on S, and P{x^.y) is the joint empirical distribution of the 
m*'' codeword Xm and the received word y. We denote this 
decoder by GMAP(M^i, . . . , Wk)- We also found that using 
the ML metrics, instead of the MAP metrics Wi, . . . , Wk, 
i.e. GLRT(PVi, . . . , Wk), is not capacity achieving. 

It is instrumental to compare our receiver with the MMI 
receiver We observe that if the codeword x,n is chosen from a 
fixed composition Px code, the empirical mutual information 

W 

IiP{x„,,y)) = sup Ep log — — — 

^ w (Px o W)y 




(64) 



Fig. 8. GMAP with worst channels algorithm: here S is represented by 
the union of all sets appearing in the figure. In this set, there are however 
only three one-sided components with respective worst channels Wi , W2 
and W3, hence, decoding with the generalized linear decoder induced by the 
three corresponding MAP metrics is capacity achieving. MMI instead would 
have required an optimization of infinitely many metrics given by all possible 
DMC's. 



where the maximization is taken over all possible DMC W, 
which means that the MMI is actually the GMAP decoders 
taking into account all DMC's. Our result says that we do not 
need to enumerate all DMC metrics to achieve capacity, for 
a given compound set S, we can restrict ourself to selecting 
carefully a subset of all metrics and yet achieve the compound 
capacity. Those important metrics are found by extracting 
the one-sided components of S, and taking the MAP metrics 
induced by the worst channel of these components. When S 
has a finite number of one-sided components, this decoder is 
generalized linear. The key step is to understand the structure 
of the space of decoding metrics. The geometric insights 
gives rise to a notion of which channels are dominated by 
which (with the one-sided property) and how to combine 
the dominant representatives of each components (Generalized 
MAP meti-ics). 

We argued that the family of sets that can be written as 
finite unions of one-sided sets covers a large variety of sets, 
even larger than the family of sets having finite unions of 
convex components. This means that the generalized linear 
decoders with finitely many metrics can be found to achieve 
capacity for a large family of compound sets. Yet, there do 
exist compound sets that are not even a finite union of one- 
sided components. To see this, we can go back to the local 
geometric picture and imagine a compound set with infinitely 
many worst channels, for which the procedure shown in Figure 
|8] has to go through an infinite number of steps. We argue, 
however, that such examples are pedagogical, in the sense that 
if one is willing to give up asmall fraction of the capacity, then 
a finite collection of linear decoding metrics would suffice. 
Moreover, there is a graceful tradeoff between the number of 
metrics used, and the loss in achievable rate. 

Even more interestingly, one can develop a notion of 
a "blind" generalized linear decoder, which does not even 
require the knowledge of the compound set, yet guarantees 




Fig. 9. A "blind" generalized linear decoder for VN 3-ary compound 
channels, with 3 metrics chosen uniformly. The hexagon drawn in the figure 
is the largest hexagon defined by those uniform metrics that contains the 
compound set in its complement. This gives the achievable rate with such a 
decoder, namely Cpdy in the figure, whereas the compound capacity is given 
by the minimum squared norm in the set, i.e. C in the figure. 



to achieve a fraction of the compound capacity. We describe 
here such decoders in the VN setting. As illustrated in Figure 
|9] such decoders are induced by a set of metrics chosen in 
a "uniform" fashion. For a given compound set, we can then 
grow a polytope whose faces are the hyperplane orthogonal 
to these metrics and there will be a largest such polytope, 
that contains the entire compound set in its complement. This 
determines the rate that can be achieved with such a decoder 
on a given compound set, cf. Cpoiy in Figure |9] In general 
Cpoiy is strictly less than the compound capacity, denoted by 
C in Figure [9j the only cases where C = Cpoiy is if by luck, 
one of the uniform direction is along the worst channel (and 
if there are enough metrics to contain the whole compound 
set). Now, for a number K of metrics, no matter what the 
compound set looks like, and not matter what its capacity is, 
the ratio between Cpoiy and C can be estimated: in the VN 
geometry, this is equivalent to picking a sphere with radius C 
and to compute the ratio between C and the "inner radius" 
of a K-polytope inscribed in the sphere. It is also clear that 
the higher the number of metrics is, the closer Cpoiy to C 
is, and this controls the tradeoff between the computational 
complexity and the achievable rate. Again, as suggested by 
the very noisy picture, there is a graceful tradeoff between the 
number of metrics used, and the loss in achievable rate. 

VI. Conclusion 

Many Information Theoretic problems evaluate the limiting 
performance of a communication scheme by an expression 
optimizing divergences under constrained probability distribu- 
tions. The divergence is not a formal distance, however, when 
the distributions are close to each other, which we had by 



considering channels to be very noisy, we are able to make 
local computations and the divergence can be approximated 
by a squared norm. We showed that the geometry govern- 
ing this local setting is the one of an inner product space, 
where notions of angles and distances are well defined. This 
geometric insight simplifies greatly the problems. Rather than 
getting a good approximation per-se, it provides a simplified 
problem, for which we have a better insight and which points 
out solutions to the original problem. It is also a powerful tool 
for finding counter-examples. Finally, we showed how in this 
problem, we could "lift" the results proven locally to results 
proven globally. 
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